Skip to main content

Statistics in Model

  • Metadata
    • tags: #statistics
  • For a project predicting the collision likelihood between trucks and other users on the road, the topic of confidence interval was brought up
    • The technique [[bootstrap-algorithm]] can be used to understand the level of confidence around regression coefficients and predictions
    • Bootstrap algorithm to understand this
    • Discussion with managers about how to implement this
      • Sample for training data (~80%)
      • Normalize the input data (in order to understand the coefficients in a linear regression)
      • Estimate the training data and record the estimated coefficients
      • Repeat n times
      • Note the range of variability of the estimated coefficients, make variable selections
      • Apply the model with the variability of estimated coefficient as the sample space to draw from
  • Designing a statistical test
    • For given scenario, we would like to test if something has caused something else to change
      • Null hypothesis: no observable change
      • Alternative hypothesis: observable change
    • There are four properties we care about: power, significance, sample size, effect size
      • These are interrelated with each other, and we can solve for any individual property if we define the other 3
    • Power (1 - β\beta)
      • Is the probability to reject null hypothesis
      • β\beta is the probability of False Negative or Type II error
    • Significance level (α\alpha)
      • Is the probability to falsely concluding to reject null hypothesis (False Positive or Type I error)
      • This is usually set at a threshold of less than 0.05, giving us the confidence level of (1−α1-\alpha) of greater than 95%
    • Sample size (nn)
      • As sample size increases, the power increases even if the significance level is held constant because the variance becomes smaller
    • Effect size (ee)
      • The separation between the means of the two distributions
      • But often if the effect is small, then to increase the power one has to sample more or relax the significance level